Project Home

Load packages used to generate figures on this page:

# General data wrangling
library(tidyverse)
library(knitr)
library(kableExtra)
library(DT)
library(lubridate)
library(readxl)
library(fakeR)

# Visualization
library(plotly)
library(forcats)
library(ggsignif)

# ggseg is used to visualize the brain
# remotes::install_github("LCBC-UiO/ggseg")
# If that doesn't work: 
# download.file("https://github.com/LCBC-UiO/ggseg/archive/master.zip", "ggseg.zip")
# unzip("ggseg.zip")
# devtools::install_local("ggseg-master")
library(ggseg)

# remotes::install_github("LCBC-UiO/ggseg3d")
library(ggseg3d)

# remotes::install_github("LCBC-UiO/ggsegExtra")
library(ggsegExtra)

Tau-PET Data

Data overview

The longitudinal tau-PET dataset was downloaded as a CSV from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) Study Data repository located at Study Data/Imaging/PET Image Analysis/UC Berkeley - AV1451 Analysis [ADNI2,3] (version: 5/12/2020). This CSV file contains 1,121 rows and 241 columns. Note:ADNI data is freely accessible to all registered users. Please see my Acknowledgments page for more information about ADNI and its contributors.

Data loading

Load partial volume corrected regional tau-PET data, as downloaded from ADNI:

tau.df <- read.csv("../../ADNI_Data/Raw_Data/UCBERKELEYAV1451_PVC_05_12_20.csv")
tau.df$EXAMDATE = as.Date(tau.df$EXAMDATE, format="%m/%d/%Y")

# update stamp is irrelevant, drop it
tau.df <- select(tau.df, -update_stamp)

Note: for those who wish to replicate this analysis but do not have access to ADNI data, this code will create the “Simulated_ADNI_TauPET.csv” file in the GitHub repo main directory:

set.seed(127)
# use simulate_dataset from fakeR
# stealth.level=2 means each column is simulated independently
tau.sim <- simulate_dataset(tau.df, digits=6, stealth.level=2, ignore=c("RID", "VISCODE", "VISCODE2", "EXAMDATE")) 
write.csv(tau.sim, "../Simulated_ADNI_TauPET.csv", row.names = F)

The simulated tau-PET dataset can be loaded as follows:

tau.df <- read.csv("../Simulated_ADNI_TauPET.csv")

Each row in the CSV represents one tau-PET scan (see str call below). Some subjects had repeated scans separated by approximately one year, while other subjects had only one scan. Columns include subject information including anonymized subject ID, visit code, and PET exam date. The other columns encode regional volume and tau-PET uptake. Specifically, there are 123 distinct cortical and subcortical regions of interest (ROIs), each of which has a volume field (in mm^3) and a tau-PET uptake field, called the Standardized Uptake Value Ratio (SUVR).

str(tau.df)
## 'data.frame':    1120 obs. of  164 variables:
##  $ RID                                   : int  21 31 31 56 56 56 59 69 69 69 ...
##  $ VISCODE                               : chr  "init" "init" "y1" "init" ...
##  $ VISCODE2                              : chr  "m144" "m144" "m156" "m144" ...
##  $ EXAMDATE                              : Date, format: "2018-02-02" "2018-04-24" "2019-04-23" "2018-02-20" ...
##  $ INFERIOR_CEREBGM_SUVR                 : num  1 1 1 1 2 2 2 1 2 1 ...
##  $ INFERIOR_CEREBGM_VOLUME               : num  59419 61509 62175 71314 65004 ...
##  $ HEMIWM_SUVR                           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ HEMIWM_VOLUME                         : num  349579 457144 320848 437909 417507 ...
##  $ BRAAK12_SUVR                          : num  2 2 1 1 2 2 2 2 2 2 ...
##  $ BRAAK12_VOLUME                        : num  11818 11919 10717 10675 11082 ...
##  $ BRAAK34_SUVR                          : num  2 2 2 2 2 2 2 2 1 2 ...
##  $ BRAAK34_VOLUME                        : num  111540 132325 116813 106945 123959 ...
##  $ BRAAK56_SUVR                          : num  1 2 2 2 2 1 2 1 1 2 ...
##  $ BRAAK56_VOLUME                        : num  372503 256383 276486 315106 267700 ...
##  $ BRAIN_STEM_SUVR                       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BRAIN_STEM_VOLUME                     : num  21606 22221 18379 20161 18646 ...
##  $ LEFT_MIDDLEFR_SUVR                    : num  2 2 2 4 2 2 2 2 1 1 ...
##  $ LEFT_MIDDLEFR_VOLUME                  : num  16386 23981 16270 19763 19691 ...
##  $ LEFT_ORBITOFR_SUVR                    : num  2 2 2 2 2 1 2 2 2 2 ...
##  $ LEFT_ORBITOFR_VOLUME                  : num  11926 11894 12090 10949 10433 ...
##  $ LEFT_PARSFR_SUVR                      : num  2 2 1 1 2 2 2 2 2 2 ...
##  $ LEFT_PARSFR_VOLUME                    : num  10766 9358 9674 8846 12440 ...
##  $ LEFT_ACCUMBENS_AREA_SUVR              : num  1 2 3 2 2 2 1 3 1 1 ...
##  $ LEFT_ACCUMBENS_AREA_VOLUME            : num  495 419 302 485 547 393 362 314 437 766 ...
##  $ LEFT_AMYGDALA_SUVR                    : num  2 2 1 1 1 2 2 2 2 2 ...
##  $ LEFT_AMYGDALA_VOLUME                  : num  964 884 992 1436 2000 ...
##  $ LEFT_CAUDATE_SUVR                     : num  2 2 2 1 2 2 2 2 1 1 ...
##  $ LEFT_CAUDATE_VOLUME                   : num  2695 4033 3606 3118 4572 ...
##  $ LEFT_HIPPOCAMPUS_SUVR                 : num  2 2 2 2 2 2 2 2 2 1 ...
##  $ LEFT_HIPPOCAMPUS_VOLUME               : num  4165 5164 5350 3687 3876 ...
##  $ LEFT_PALLIDUM_SUVR                    : num  2 3 2 2 3 2 2 2 2 3 ...
##  $ LEFT_PALLIDUM_VOLUME                  : num  1516 1252 1427 1019 828 ...
##  $ LEFT_PUTAMEN_SUVR                     : num  2 1 2 2 1 1 2 2 1 2 ...
##  $ LEFT_PUTAMEN_VOLUME                   : num  4256 5294 5061 4643 5507 ...
##  $ LEFT_THALAMUS_PROPER_SUVR             : num  1 1 2 1 1 1 1 1 1 1 ...
##  $ LEFT_THALAMUS_PROPER_VOLUME           : num  6569 8386 7847 6493 6414 ...
##  $ RIGHT_MIDDLEFR_SUVR                   : num  3 2 3 1 2 2 2 1 3 1 ...
##  $ RIGHT_MIDDLEFR_VOLUME                 : num  21241 22253 16509 19637 19844 ...
##  $ RIGHT_ORBITOFR_SUVR                   : num  3 2 2 2 3 2 2 2 2 2 ...
##  $ RIGHT_ORBITOFR_VOLUME                 : num  10913 12948 11937 12336 13689 ...
##  $ RIGHT_PARSFR_SUVR                     : num  2 2 2 2 2 2 2 1 1 2 ...
##  $ RIGHT_PARSFR_VOLUME                   : num  9044 7451 9729 6696 9493 ...
##  $ RIGHT_ACCUMBENS_AREA_SUVR             : num  1 2 2 1 2 2 2 2 2 1 ...
##  $ RIGHT_ACCUMBENS_AREA_VOLUME           : num  499 368 564 539 547 251 596 630 299 577 ...
##  $ RIGHT_AMYGDALA_SUVR                   : num  3 2 2 1 2 2 3 1 1 2 ...
##  $ RIGHT_AMYGDALA_VOLUME                 : num  1609 1920 1471 1833 1766 ...
##  $ RIGHT_CAUDATE_SUVR                    : num  2 1 2 2 1 2 2 2 2 1 ...
##  $ RIGHT_CAUDATE_VOLUME                  : num  3431 2791 4132 4108 3909 ...
##  $ RIGHT_HIPPOCAMPUS_SUVR                : num  2 2 1 2 1 1 2 1 2 2 ...
##  $ RIGHT_HIPPOCAMPUS_VOLUME              : num  4735 2997 4092 3966 3441 ...
##  $ RIGHT_PALLIDUM_SUVR                   : num  2 2 2 1 2 2 2 2 2 2 ...
##  $ RIGHT_PALLIDUM_VOLUME                 : num  1310 1777 1514 1456 1131 ...
##  $ RIGHT_PUTAMEN_SUVR                    : num  1 2 2 2 2 2 1 1 2 2 ...
##  $ RIGHT_PUTAMEN_VOLUME                  : num  3964 4335 3813 4173 3828 ...
##  $ RIGHT_THALAMUS_PROPER_SUVR            : num  1 1 1 1 2 2 1 1 1 2 ...
##  $ RIGHT_THALAMUS_PROPER_VOLUME          : num  6070 5340 6230 6432 6233 ...
##  $ CHOROID_SUVR                          : num  5 3 4 4 4 4 4 2 3 3 ...
##  $ CHOROID_VOLUME                        : num  4079 4221 3265 4180 3362 ...
##  $ CTX_LH_BANKSSTS_SUVR                  : num  2 1 3 3 2 2 2 2 3 1 ...
##  $ CTX_LH_BANKSSTS_VOLUME                : num  2148 1464 2267 2278 2458 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE_SUVR   : num  2 1 2 2 2 2 1 2 2 2 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE_VOLUME : num  1007 1377 1697 1863 2056 ...
##  $ CTX_LH_CUNEUS_SUVR                    : num  2 3 2 2 3 2 1 2 2 2 ...
##  $ CTX_LH_CUNEUS_VOLUME                  : num  1852 3014 2724 3101 2450 ...
##  $ CTX_LH_ENTORHINAL_SUVR                : num  2 1 1 3 4 2 2 4 1 2 ...
##  $ CTX_LH_ENTORHINAL_VOLUME              : num  1600 1154 1668 976 1450 ...
##  $ CTX_LH_FUSIFORM_SUVR                  : num  2 2 1 3 1 2 2 1 1 2 ...
##  $ CTX_LH_FUSIFORM_VOLUME                : num  7041 9366 9118 7329 7996 ...
##  $ CTX_LH_INFERIORPARIETAL_SUVR          : num  2 1 3 1 1 1 4 2 3 2 ...
##  $ CTX_LH_INFERIORPARIETAL_VOLUME        : num  10272 11016 8772 11600 14050 ...
##  $ CTX_LH_INFERIORTEMPORAL_SUVR          : num  1 1 1 2 2 4 2 2 2 2 ...
##  $ CTX_LH_INFERIORTEMPORAL_VOLUME        : num  9421 7822 14548 11770 11636 ...
##  $ CTX_LH_INSULA_SUVR                    : num  2 2 1 2 1 2 2 1 1 2 ...
##  $ CTX_LH_INSULA_VOLUME                  : num  6394 6402 6226 6749 5976 ...
##  $ CTX_LH_ISTHMUSCINGULATE_SUVR          : num  2 2 1 3 2 2 1 1 2 4 ...
##  $ CTX_LH_ISTHMUSCINGULATE_VOLUME        : num  1796 1437 1742 2513 3020 ...
##  $ CTX_LH_LATERALOCCIPITAL_SUVR          : num  3 2 2 3 1 1 2 3 1 3 ...
##  $ CTX_LH_LATERALOCCIPITAL_VOLUME        : num  9831 11003 9581 9683 10864 ...
##  $ CTX_LH_LINGUAL_SUVR                   : num  2 1 2 1 2 2 1 2 2 1 ...
##  $ CTX_LH_LINGUAL_VOLUME                 : num  7159 7026 6278 5887 6189 ...
##  $ CTX_LH_MIDDLETEMPORAL_SUVR            : num  3 3 2 2 3 2 4 3 1 1 ...
##  $ CTX_LH_MIDDLETEMPORAL_VOLUME          : num  7930 10601 10646 9957 8423 ...
##  $ CTX_LH_PARACENTRAL_SUVR               : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ CTX_LH_PARACENTRAL_VOLUME             : num  3052 3226 3088 3529 3935 ...
##  $ CTX_LH_PARAHIPPOCAMPAL_SUVR           : num  2 3 2 1 2 2 1 3 2 1 ...
##  $ CTX_LH_PARAHIPPOCAMPAL_VOLUME         : num  2013 1810 1594 1850 1898 ...
##  $ CTX_LH_PERICALCARINE_SUVR             : num  2 2 1 2 2 2 1 2 2 2 ...
##  $ CTX_LH_PERICALCARINE_VOLUME           : num  1561 2159 2461 1494 2111 ...
##  $ CTX_LH_POSTCENTRAL_SUVR               : num  1 2 2 1 2 3 2 2 1 2 ...
##  $ CTX_LH_POSTCENTRAL_VOLUME             : num  7973 7213 7235 8763 10897 ...
##  $ CTX_LH_POSTERIORCINGULATE_SUVR        : num  2 2 1 1 2 1 2 2 2 2 ...
##  $ CTX_LH_POSTERIORCINGULATE_VOLUME      : num  2582 3673 3770 2343 2015 ...
##  $ CTX_LH_PRECENTRAL_SUVR                : num  2 1 1 2 2 2 2 2 1 1 ...
##  $ CTX_LH_PRECENTRAL_VOLUME              : num  11538 11307 12060 11164 11742 ...
##  $ CTX_LH_PRECUNEUS_SUVR                 : num  3 3 2 4 2 2 3 2 -1 3 ...
##  $ CTX_LH_PRECUNEUS_VOLUME               : num  8157 7833 8414 9032 8882 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE_SUVR  : num  2 2 2 1 1 2 2 1 1 2 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE_VOLUME: num  2448 2019 2531 1977 2126 ...
##  $ CTX_LH_SUPERIORFRONTAL_SUVR           : num  1 2 2 2 2 2 2 2 2 2 ...
##   [list output truncated]

The SUVR value is normalized to the tau-PET uptake in the inferior cerebellum gray matter (highlighted in blue below), a commonly-used region for tau normalization given the lack of inferior cerebellar tau pathology in Alzheimer’s Disease.

aseg_3d %>% 
  unnest(ggseg_3d) %>% 
  ungroup() %>% 
  select(region) %>% 
  na.omit() %>% 
  mutate(val=ifelse(region %in% c("Right-Cerebellum-Cortex", "Left-Cerebellum-Cortex"), 1, 0)) %>%
  ggseg3d(atlas=aseg_3d, label="region", text="val", colour="val", na.alpha=0.5, 
           palette=c("transparent", "deepskyblue3"), show.legend=F) %>%
  add_glassbrain() %>%
  pan_camera("left lateral") %>%
  remove_axes()

All cortical and subcortical ROIs were delineated by first co-registering the tau-PET image to a high-resolution structural T1-weighted MPRAGE acquired in the same imaging session, and then applying FreeSurfer (v5.3) for automated regional segmentation and parcellation. The following diagram from Marcoux et al. (2018) summarizes this process:

Furthermore, to mitigate issues with lower voxel resolution in PET imaging, partial volume correction was applied to use probabilistic tissue segmentation maps to refine individual ROIs. Note: these PET processing steps were all performed by Susan Landau, Deniz Korman, and William Jagust at the Helen Wills Neuroscience Institute, UC Berkeley and Lawrence Berkeley National Laboratory.

18F-AV-1451 is a relatively recent PET tracer, and was only incorporated into the ADNI-3 pipeline beginning in 2016. I am curious about the temporal distribution of the FreeSurfer-analyzed scans here:

tau.df %>%
  # RID = unique subject identifier, EXAMDATE=date of PET scan
  select(RID, EXAMDATE) %>%
  # Create interactive plotly histogram, which auto-formats date along x-axis
  plot_ly(x=~EXAMDATE, type="histogram",
          marker = list(color = "lightsteelblue",
                        line = list(color = "lightslategray",
                                    width = 1.5))) %>%
  layout(title = 'Tau-PET Scan Date Distribution',
         xaxis = list(title = 'Scan Date',
                      zeroline = TRUE),
         yaxis = list(title = 'Number of PET Scans')) 

Even though ADNI3 officially began in 2016, most scans were acquired from mid-2017 to present day. This doesn’t affect this analysis, though, since the same tau-PET imaging protocol has been used across sites since 2016.

Data distribution

Since this project will explore tau-PET measurements over time, I will be refining the dataset to only subjects with multiple tau-PET scans. Here’s the overall distribution of number of longitudinal scans by subject:

# Plot number of PET scans by subject in a bar chart
p_num_long <- tau.df %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_scans=n()) %>%
  ggplot(., aes(x=fct_reorder(RID, n_scans, .desc=T), y=n_scans)) +
  geom_bar(stat="identity", aes(fill=n_scans, color=n_scans)) +
  labs(fill="Count", color="Count") +
  ggtitle("Number of Longitudinal PET Scans per Subject") +
  ylab("Number of PET Scans") +
  xlab("Subject") +
  theme(axis.text.x=element_blank(),
        plot.title=element_text(hjust=0.5)) 

# Convert to interactive plotly chart
ggplotly(p_num_long)
rm(p_num_long)

The majority of subjects only had one tau-PET scan; given the longitudinal nature of this project, such subjects will eventually be omitted from analysis. On that note, it’s important to know exactly how many subjects do have at least two tau-PET scans:

# Calculate number of subjects with 2+ PET scans
num_scans <- tau.df %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  # Tabulate # scans by subject and filter
  summarise(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  ungroup() %>%
  summarise(num_subjects=n(),
            total_scans=sum(n_scans))
# Print results
cat("Number of subjects with at least two scans: **", 
    num_scans$num_subjects, "**\n", 
    "\nNumber of total PET scans: **", 
    num_scans$total_scans, "**\n", sep="")

Number of subjects with at least two scans: 249

Number of total PET scans: 593

So, we have 249 subjects with two or more scans.

Temporal distribution

Another important consideration is the length of time between each consecutive scan. I will eventually normalize changes in tau-PET to number of years passed to yield an annual rate of change, but it’s good to know what that time interval is:

# Plot the # years in between each pair of consecutive tau-PET scans
p_pet_interval <- tau.df %>%
  select(RID, EXAMDATE) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  # Calculate difference between pairs of scan dates using lag
  mutate(Years_between_Scans = 
           as.numeric((EXAMDATE - lag(EXAMDATE, 
                                      default = EXAMDATE[1]))/365)) %>%
  # Omit the first scan, for which the time interval is zero
  filter(Years_between_Scans>0) %>%
  ggplot(., aes(x=Years_between_Scans)) +
  geom_histogram(stat="count", color="lightslategray") +
  ggtitle("Years in between Tau-PET Scans per Subject") +
  ylab("Frequency") +
  xlab("# Years between two consecutive scans for a subject") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5)) 

# Convert to interactive plotly histogram
ggplotly(p_pet_interval)
rm(p_pet_interval)

There’s a cluster of scans around the one-year mark. Presumably, ADNI3 participants are enrolled in an annual tau-PET plan, though in some cases scans aren’t at precise yearly intervals.

Data missingness

I’ll check if there are any missing data points for tau-PET SUVR values for any of the FreeSurfer-derived cortical or subcortical regions of interest (ROIs). Note: this is filtered to show only subjects with at least two scans:

# Calculate number and proportion of missing data records for each measured region of interest (ROI)
tau.df %>%
  select(-VISCODE, -VISCODE2) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  ungroup() %>%
  # filter only to the SUVR columns
  select(!matches("VOLUME")) %>%
  # Reshape from wide --> long
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = str_replace(ROI, "_SUVR", "")) %>%
  group_by(ROI) %>%
  # Calculate number and proportion of missing data points for each ROI
  summarise(`Percent Missing` = sum(is.na(SUVR))/n(),
            `Number Missing` = sum(is.na(SUVR))) %>%
  datatable()

All regions have zero missing data points, so no imputation will be necessary.

SUVR Distribution by Region

Now, I’ll check the distribution of tau-PET uptake values across the ROIs.

# plot the mean tau-PET SUVR for each region of interest
p_roi_suvr <- tau.df %>%
  # omit irrelevant variables
  select(-VISCODE, -VISCODE2) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  # only look at SUVR columns; reshape wide --> long
  select(!matches("VOLUME")) %>%
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = tolower(str_replace(ROI, "_SUVR", ""))) %>%
  group_by(ROI) %>%
  # calculate mean and SD, along with ymin and ymax, to plot error bars for each ROI
  summarise(Mean_SUVR=mean(SUVR, na.rm=T),
            SD_SUVR = sd(SUVR, na.rm=T),
            ymin = Mean_SUVR-SD_SUVR,
            ymax = Mean_SUVR+SD_SUVR) %>%
  # fct_reorder --> arrange ROI by its average tau-PET SUVR
  ggplot(data=., mapping=aes(x=fct_reorder(ROI, Mean_SUVR, .desc=F), 
                             y=Mean_SUVR,
                             label = ROI)) +
  geom_bar(stat="identity", show.legend=F, fill="lightsteelblue") +
  # Add error bars
  geom_errorbar(aes(ymin=ymin, ymax=ymax), width=0, color="lightslategray") +
  coord_flip() +
  theme_minimal() +
  ylab("Mean Tau-PET SUVR") +
  xlab("Region of Interest") +
  ggtitle("Mean Tau-PET SUVR by ROI") +
  theme(axis.text.x=element_text(size=8, angle=45))

# Convert to interactive plotly graph
ggplotly(p_roi_suvr, height=1000, width=600, tooltip=c("label", "y"))
rm(p_roi_suvr)

ROI Normalization

These values are supposed to be normalized to the inferior cerebellum gray matter, indicated by INFERIOR_CEREBGM_SUVR. To confirm, I’ll check the distribution of INFERIOR_CEREBGM_SUVR values.

# Plot distribution of inferior cerebellum gray tau-PET SUVR
p_inf_cb <- tau.df %>%
  select(-VISCODE, -VISCODE2) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  # Select only SUVR columns for ROIs; pivot wide --> long
  select(!matches("VOLUME")) %>%
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = str_replace(ROI, "_SUVR", "")) %>%
  # Filter to inferior cerebellum gray
  filter(ROI=="INFERIOR_CEREBGM") %>%
  ggplot(data=., mapping=aes(x=SUVR)) +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ylab("Number of Occurences") +
  xlab("Inferior Cerebellum Gray SUVR") +
  ggtitle("Distribution of Inferior Cerebellum Gray Matter Tau Uptake") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to interactive plotly histogram
ggplotly(p_inf_cb)
rm(p_inf_cb)

Most of the inferior cerebellum gray ROIs actually have SUVRs around 1.25. I’ll re-normalize all ROI values to the mean of this distribution in Data Preparation.

Subject demographics + cognitive assessments

Data overview

The longitudinal subject demographic and cognitive assessment dataset was downloaded as a CSV from the ADNI Study Data repository, located at Study Data/Study Info/Key ADNI tables merged into one table. This CSV file contains 14,816 rows and 113 columns. This includes many subjects with multiple follow-up visits. Columns include (anonymized) subject demographic information such as age, sex, race, and marriage status. Note: ADNI data is freely accessible to all registered users. Please see my Acknowledgments page for more information about ADNI and its contributors.

This project will one key target feature in this dataset: Clinical Dementia Rating (CDR) Sum of Boxes. The CDR scale was initially developed in 1982 at the Washington University as a metric of clinical dementia progression (Hughes et al., 1982). This cognitive test measures impairment in six cognitive domains: memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care (Morris 1991). Each of these categories is scored on a three-point scale as follows:

  • 0 = no impairment
  • 0.5 = questionable
  • 1 = mild dementia
  • 2 = moderate dementia
  • 3 = severe dementia

The global score is most heavily influenced by the memory score, though there is an established decision tree-like algorithm for how to calculate the global score. By contrast, the CDR Sum of Boxes reflects the sum of scores for each of the six domains, with an overall range of 0 (no impairment) to 18 (severe impairment). The CDR Sum of Boxes is an extension upon the CDR global score, offering a more detailed quantitative score. This metric reportedly improves precision in longitudinal cognitive tracking, particularly in cases of mild dementia (O’Bryant et al., 2008). Of note, the CDR assessment shows high inter-rater reliability, which is important given the inter-site nature of the ADNI cohort (Morris 1991).

Sources:

  • Hughes, C. P., Berg, L., Danziger, W., Coben, L. A., & Martin, R. L. (1982). A new clinical scale for the staging of dementia. The British journal of psychiatry, 140(6), 566-572.
  • Morris, J. C. (1991). The clinical dementia rating (CDR): Current version and scoring rules. Neurology, 43(11), 1588-1592.
  • O’Bryant, S. E., Waring, S. C., Cullum, C. M., Hall, J., Lacritz, L., Massman, P. J., … & Doody, R. (2008). Staging dementia using Clinical Dementia Rating Scale Sum of Boxes scores: a Texas Alzheimer’s research consortium study. Archives of neurology, 65(8), 1091-1095.

Data loading

ADNI compiled a merged dataset containing key information from several tables, including subject demographics, selected cognitive assessment scores, and select biomarker data.

# Read in subject demographic dataset from ADNI
subj.info <- read.csv("../../ADNI_Data/Raw_Data/ADNIMERGE.csv", stringsAsFactors = T, na.strings="")

Note: as with the tau-PET data, I will simulate the data I downloaded from ADNI so that interested readers can follow along, using the “Simulated_ADNI_cognitive_scores.csv” file in the GitHub repo main directory:

set.seed(127)
# subset data to be simulated:
data.to.sim <- select(subj.info, c(RID, VISCODE, EXAMDATE, AGE, PTGENDER, CDRSB))

# use simulate_dataset from fakeR
# stealth.level=2 means each column is simulated independently
subj.sim <- simulate_dataset(data.to.sim, digits=6, stealth.level=2, ignore=c("RID", "VISCODE", "EXAMDATE"))
write.csv(subj.sim, "../Simulated_ADNI_cognitive_scores.csv", row.names = F)

The simulated cognitive scores dataset can be loaded as follows:

subj.info <- read.csv("../Simulated_ADNI_cognitive_scores.csv")

Note: I am using the real ADNI data, so results will vary from those obtained with the simulated data.

The cognitive score dataset columns I will be using for this project include:

  • RID: Participant roster ID, which serves as unique subject identifier
  • VISCODE: Visit code
  • EXAMDATE: Date
  • AGE: Age at visit
  • PTGENDER: Biological sex
  • CDRSB: CDR Sum-of-Boxes score at visit
subj.info$EXAMDATE <- as.Date(subj.info$EXAMDATE, format="%m/%d/%Y")
summary(subj.info %>% select(RID, VISCODE, EXAMDATE, AGE, PTGENDER, CDRSB))
##       RID          VISCODE        EXAMDATE               AGE          PTGENDER        CDRSB       
##  Min.   :   2   bl     :2272   Min.   :2005-09-07   Min.   :54.40   Female:6593   Min.   : 0.000  
##  1st Qu.: 685   m12    :1836   1st Qu.:2008-12-23   1st Qu.:69.10   Male  :8223   1st Qu.: 0.000  
##  Median :2074   m06    :1618   Median :2012-06-19   Median :73.50                 Median : 1.000  
##  Mean   :2613   m24    :1451   Mean   :2012-05-02   Mean   :73.48                 Mean   : 2.083  
##  3rd Qu.:4513   m18    :1292   3rd Qu.:2014-03-31   3rd Qu.:78.30                 3rd Qu.: 3.000  
##  Max.   :6874   m36    : 857   Max.   :2020-07-27   Max.   :91.40                 Max.   :18.000  
##                 (Other):5490                        NA's   :4                     NA's   :4172

There is a lot of missing data in this dataset – however, this dataset includes many subjects and visit dates that don’t correspond to tau-PET scans, and therefore won’t be used in this analysis. Missingness will be re-evaluated once the PET data and subject demographic data is merged in Data Preparation.

Data distribution

The time distribution of ADNI cognitive assessment data can be visualized:
# plotly histogram of cognitive assessment dates
subj.info %>%
  select(RID, EXAMDATE) %>%
  plot_ly(x=~EXAMDATE, type="histogram",
          marker = list(color = "lightsteelblue",
                        line = list(color = "lightslategray",
                                    width = 1.5))) %>%
  layout(title = 'Subject Demographics Date Distribution',
         xaxis = list(title = 'Visit Date',
                      zeroline = TRUE),
         yaxis = list(title = 'Number of Subjects')) 

One thing to note is that tau-PET was only incorporated into the ADNI pipeline in late 2015/early 2016, so any demographic information from pre-2016 will not be included in modeling and analysis. Let’s check how many subjects had more than one visit recorded in this dataset:

# visualize distribution of # cognitive assessments per subject
p_subj_long <- subj.info %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_exams=n()) %>%
  # fct_reorder --> arrange subject on x-axis by number of exams, from large to small
  ggplot(., aes(x=fct_reorder(RID, n_exams, .desc=T), y=n_exams, label=RID)) +
  geom_bar(stat="identity", aes(fill=n_exams, color=n_exams)) +
  labs(fill="Count", color="Count") +
  ggtitle("Number of ADNI Visits per Subject") +
  ylab("Number of Visits") +
  xlab("Subjects") +
  theme(axis.text.x=element_blank(),
        plot.title=element_text(hjust=0.5)) 

# convert to interactive plotly histogram
ggplotly(p_subj_long, tooltip = c("y"))
rm(p_subj_long)

Unlike with the PET data, most subjects have two or more visits recorded with cognitive and demographic information. The subjects in this dataset have up to 21 CDR-Sum of Boxes scores. To examine precisely how many subjects have at least two visits recorded:

num.subj <- subj.info %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  summarise(num_subjects=n(),
            total_exams=sum(n_exams))
cat("Number of subjects with at least two ADNI visits: **", 
    num.subj$num_subjects, "**\n", 
    "\nTotal number of longitudinal ADNI visits recorded: **", 
    num.subj$total_exams, "**\n", sep="")

Number of subjects with at least two ADNI visits: 2027

Total number of longitudinal ADNI visits recorded: 14571

There are 2027 subjects with two or more ADNI visits in this dataset. This should include all subjects and visit dates included in the tau-PET dataset, which will be confirmed upon merging the datasets in the Data Preparation stage.

Temporal distribution

It’s also worth checking the distribution of time interval between ADNI visits in these 2027 subjects with two or more visits:

# Plot # years between consecutive cognitive assessments by subject
p.subj.interval <- subj.info %>%
  select(RID, EXAMDATE) %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  arrange(EXAMDATE) %>%
  # Use lag to calculate time interval between exams
  mutate(Years_between_ADNI = 
           as.numeric((EXAMDATE - lag(EXAMDATE, 
                                      default = EXAMDATE[1]))/365)) %>%
  # Omit first scan per subject, for which the time interval is zero
  filter(Years_between_ADNI>0) %>%
  ggplot(., aes(x=Years_between_ADNI)) +
  geom_histogram(stat="count", fill="lightsteelblue", color="lightslategray") +
  ggtitle("Years in between ADNI visits per Subject") +
  ylab("Frequency") +
  xlab("# Years between two consecutive ADNI visits") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5)) 

# Convert to plotly interactive histogram
ggplotly(p.subj.interval)
rm(p.subj.interval)

Interestingly, there is a clear peak around 0.5 (six months) and a smaller peak around 1 (one year), indicating that most visits were spaced between 6 and 12 months apart. However, there is a positive skew to this distribution showing that a portion of subjects went up to five years in between visits. This is not likely to affect my analysis, as most tau-PET subjects had PET scans from 2018 onward, and would therefore have a follow-up interval of two years or less.

CDR-Sum of Boxes Scores

Now, I’ll look into the distribution of the target variable (CDRSB) and how they relate to other covariates, namely age and sex. These visualizations are filtered to show only those subjects with 2+ assessments.

# Plot CDR-SoB scores distribution across subjects with 2+ assessments
p.cdr.scores <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=CDRSB)) +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ylab("# of Occurences") +
  xlab("CDR-Sum of Boxes") +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes Distribution") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to plotly interactive histogram
ggplotly(p.cdr.scores)
rm(p.cdr.scores)

CDR-SoB by Age + Sex

Now, to stratify by age and sex, respectively:

# Violin plot by sex for CDR-SoB
p.cdr.sex.violin <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=PTGENDER, y=CDRSB)) +
  geom_violin(aes(fill=PTGENDER)) +
  theme_minimal() +
  ylab("CDR Sum of Boxes Score") +
  xlab("Biological Sex") +
  geom_signif(map_signif_level = F,
              test="t.test",
              comparisons=list(c("Female", "Male"))) +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes by Sex") +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")

# CDR-SoB histogram by sex
p.cdr.sex.hist <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=CDRSB)) +
  geom_histogram(aes(y=..count.., fill=PTGENDER)) +
  facet_wrap(PTGENDER ~ ., ncol=1) +
  theme_minimal() +
  ylab("Number of Subjects") +
  xlab("CDR Sum of Boxes Score") +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes Distribution by Sex") +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")


# Convert plots to interactive plotly visualizations
p.cdr.sex.violin <- ggplotly(p.cdr.sex.violin)
p.cdr.sex.hist <- ggplotly(p.cdr.sex.hist)

# Create plotly layout using subplot, keep x and y axes distinct
plotly::subplot(p.cdr.sex.violin, p.cdr.sex.hist, shareX=F, shareY=F,
                titleX=T, titleY=T) %>% 
  # Manually supply x- and y-axis titles
  layout(xaxis = list(title = "Biological Sex", 
                      titlefont = list(size = 12)), 
         xaxis2 = list(title = "CDR Sum of Boxes Score", 
                       titlefont = list(size = 12)),
         yaxis=list(title="CDR Sum of Boxes Score", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)),
         yaxis3 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)))

rm(p.cdr.sex.hist, p.cdr.sex.violin)

The distribution of CDR Sum of Boxes score looks similar between Females and Males by eye, but I’ll follow up with a t-test to confirm:

# t-test for CDR-SoB by sex
t.test.df <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() 
t.test(CDRSB ~ PTGENDER, data=t.test.df)
rm(t.test.df)
## 
##  Welch Two Sample t-test
## 
## data:  CDRSB by PTGENDER
## t = -2.804, df = 9542.1, p-value = 0.005057
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.27228130 -0.04822416
## sample estimates:
## mean in group Female   mean in group Male 
##             2.000650             2.160903

In fact, there is actually a statistically significant (p<0.05) difference in CDR Sum of Boxes scores between males and females, with male subjects exhibiting an average score ~0.15 points higher than female subjects. This is an important consideration, and I will be sure to include sex as a covariate in prediction models where applicable.

Similarly, I’ll compare CDR Sum of Boxes with age:

# Plot CDR-SoB by age in scatter plot, only for subjects with 2+ cognitive assessments
p.age.cdr <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  ggplot(data=., mapping=aes(x=AGE, y=CDRSB)) +
  labs(color="Number of Points") +
  xlab("Age at Visit") +
  ylab("CDR Sum of Boxes") +
  ggtitle("CDR Sum of Boxes vs. Age at Visit") +
  geom_point(size=1, alpha=0.2, color="lightslategray", fill="lightslategray") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")

# Plot histogram of age distribution for subjects with 2+ cognitive assessments
p.age.dist <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  ggplot(data=., mapping=aes(x=AGE)) +
  xlab("Number of Occurrences") +
  ylab("Age at Visit") +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ggtitle("CDR Sum of Boxes vs. Age at Visit") +
  theme(plot.title=element_text(hjust=0.5))

p.age.cdr <- ggplotly(p.age.cdr)
p.age.dist <- ggplotly(p.age.dist)
# Create plotly layout using subplot, keep x and y axes distinct
plotly::subplot(p.age.cdr, p.age.dist, shareX=F, shareY=F,
                titleX=T, titleY=T, margin = 0.05) %>% 
  # Manually supply x- and y-axis titles
  layout(xaxis = list(title = "Biological Sex", 
                      titlefont = list(size = 12)), 
         xaxis2 = list(title = "Age at Visit", 
                       titlefont = list(size = 12)),
         yaxis=list(title="CDR Sum of Boxes Score", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)),
         autosize = F, width = 800, height = 500)

rm(p.age.cdr, p.age.dist)

While there doesn’t appear to be any clear linear association between age at visit and CDR Sum of Boxes score, I’ll use cor.test to calculate the Pearson correlation coefficient and the corresponding p-value based on the correlation coefficient t-statistic:

# Correlation test between age and CDR-SoB for subjects with 2+ cognitive assessments
cor.test.df <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() 
cor.test(cor.test.df$AGE, cor.test.df$CDRSB)
## 
##  Pearson's product-moment correlation
## 
## data:  cor.test.df$AGE and cor.test.df$CDRSB
## t = 6.8375, df = 10395, p-value = 8.512e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04775209 0.08602444
## sample estimates:
##        cor 
## 0.06691288

Interestingly, this correlation coefficient is statistically significant (p=8.512e-12), but the effect size is very small (r=0.067). This significance seems to reflect the sheer size of the dataset rather than a strong relationship between age and CDR sum of boxes scores. Nevertheless, I will explore the use of age as a covariate in modeling later as this is a common practice.

Outlier detection

To better identify outliers based on this multivariate dataset, I will calculate Cook’s distance for each data point once the tau-PET data is merged with the cognitive status data in the Data Preparation stage.

Previous page: Business Understanding Next page: Data Preparation